Portability	GHC
Stability	experimental
Maintainer	bos@serpentine.com

Data.Text.ICU

Contents

Data representation
Types
Boundary analysis
Case mapping
Iteration
Normalization
String comparison
- Normalization-sensitive string comparison
- Locale-sensitive string collation
Regular expressions

Description

Commonly used functions for Unicode, implemented as bindings to the International Components for Unicode (ICU) libraries.

This module contains only the most commonly used types and functions. Other modules in this package expose richer interfaces.

Synopsis

Data representation

The Haskell Text type is implemented as an array in the Haskell heap. This means that its location is not pinned; it may be copied during a garbage collection pass. ICU, on the other hand, works with strings that are allocated in the normal system heap and have a fixed address.

To accommodate this need, these bindings use the functions from Data.Text.Foreign to copy data between the Haskell heap and the system heap. The copied strings are still managed automatically, but the need to duplicate data does add some performance and memory overhead.

Types

data LocaleName Source

The name of a locale.

Constructors

Root	The root locale. For a description of resource bundles and the root resource, see http://userguide.icu-project.org/locale/resources.
Locale String	A specific locale.
Current	The program's current locale.

Instances

Eq LocaleName
Ord LocaleName
Read LocaleName
Show LocaleName
IsString LocaleName

Boundary analysis

Text boundary analysis is the process of locating linguistic boundaries while formatting and handling text. Examples of this process include:

Locating appropriate points to word-wrap text to fit within specific margins while displaying or printing.
Counting characters, words, sentences, or paragraphs.
Making a list of the unique words in a document.
Figuring out if a given range of text contains only whole words.
Capitalizing the first letter of each word.
Locating a particular unit of the text (For example, finding the third word in the document).

The Breaker type was designed to support these kinds of tasks.

For the impure boundary analysis API (which is richer, but less easy to use than the pure API), see the Data.Text.ICU.Break module. The impure API supports some uses that may be less efficient via the pure API, including:

Locating the beginning of a word that the user has selected.
Determining how far to move the text cursor when the user hits an arrow key (Some characters require more than one position in the text store and some characters in the text store do not display at all).

data Breaker a Source

A boundary analyser.

data Break a Source

A break in a string.

Instances

Eq a => Eq (Break a)
Show a => Show (Break a)

brkPrefix :: Break a -> Text Source

Prefix of the current break.

brkBreak :: Break a -> Text Source

Text of the current break.

brkSuffix :: Break a -> Text Source

Suffix of the current break.

brkStatus :: Break a -> aSource

Status of the current break (only meaningful if Line or Word).

data Line Source

Line break status.

Constructors

Soft	A soft line break is a position at which a line break is acceptable, but not required.
Hard

Instances

Enum Line
Eq Line
Show Line

data Word Source

Word break status.

Constructors

Uncategorized	A "word" that does not fit into another category. Includes spaces and most punctuation.
Number	A word that appears to be a number.
Letter	A word containing letters, excluding hiragana, katakana or ideographic characters.
Kana	A word containing kana characters.
Ideograph	A word containing ideographic characters.

Instances

Enum Word
Eq Word
Show Word

breakCharacter :: LocaleName -> Breaker ()Source

Break a string on character boundaries.

Character boundary analysis identifies the boundaries of Extended Grapheme Clusters, which are groupings of codepoints that should be treated as character-like units for many text operations. Please see Unicode Standard Annex #29, Unicode Text Segmentation, http://www.unicode.org/reports/tr29/ for additional information on grapheme clusters and guidelines on their use.

breakLine :: LocaleName -> Breaker Line Source

Break a string on line boundaries.

Line boundary analysis determines where a text string can be broken when line wrapping. The mechanism correctly handles punctuation and hyphenated words.

breakSentence :: LocaleName -> Breaker ()Source

Break a string on sentence boundaries.

Sentence boundary analysis allows selection with correct interpretation of periods within numbers and abbreviations, and trailing punctuation marks such as quotation marks and parentheses.

breakWord :: LocaleName -> Breaker Word Source

Break a string on word boundaries.

Word boundary analysis is used by search and replace functions, as well as within text editing applications that allow the user to select words with a double click. Word selection provides correct interpretation of punctuation marks within and following words. Characters that are not part of a word, such as symbols or punctuation marks, have word breaks on both sides.

breaks :: Breaker a -> Text -> [Break a]Source

Return a list of all breaks in a string, from left to right.

breaksRight :: Breaker a -> Text -> [Break a]Source

Return a list of all breaks in a string, from right to left.

Case mapping

toCaseFold Source

Arguments

:: Bool	Whether to include or exclude mappings for dotted and dotless I and i that are marked with `I` in `CaseFolding.txt`.
-> Text
-> Text

Case-fold the characters in a string.

Case folding is locale independent and not context sensitive, but there is an option for treating the letter I specially for Turkic languages. The result may be longer or shorter than the original.

toLower :: LocaleName -> Text -> Text Source

Lowercase the characters in a string.

Casing is locale dependent and context sensitive. The result may be longer or shorter than the original.

toUpper :: LocaleName -> Text -> Text Source

Uppercase the characters in a string.

Casing is locale dependent and context sensitive. The result may be longer or shorter than the original.

Iteration

data CharIterator Source

A type that supports efficient iteration over Unicode characters.

As an example of where this may be useful, a function using this type may be able to iterate over a UTF-8 ByteString directly, rather than first copying and converting it to an intermediate form. This type also allows e.g. comparison between Text and ByteString, with minimal overhead.

Instances

Eq CharIterator
Ord CharIterator
Show CharIterator

fromString :: String -> CharIterator Source

Construct a CharIterator from a Unicode string.

fromText :: Text -> CharIterator Source

Construct a CharIterator from a Unicode string.

fromUtf8 :: ByteString -> CharIterator Source

Construct a CharIterator from a Unicode string encoded as a UTF-8 ByteString.

Normalization

data NormalizationMode Source

Normalization modes.

Constructors

None	No decomposition/composition.
NFD	Canonical decomposition.
NFKD	Compatibility decomposition.
NFC	Canonical decomposition followed by canonical composition.
NFKC	Compatibility decomposition followed by canonical composition.
FCD	"Fast C or D" form.

Instances

Enum NormalizationMode
Eq NormalizationMode
Show NormalizationMode
Typeable NormalizationMode

normalize :: NormalizationMode -> Text -> Text Source

Normalize a string according the specified normalization mode.

quickCheck :: NormalizationMode -> Text -> Maybe Bool Source

Perform an efficient check on a string, to quickly determine if the string is in a particular normalization form.

A Nothing result indicates that a definite answer could not be determined quickly, and a more thorough check is required, e.g. with isNormalized. The user may have to convert the string to its normalized form and compare the results.

A result of Just True or Just False indicates that the string definitely is, or is not, in the given normalization form.

isNormalized :: NormalizationMode -> Text -> Bool Source

Indicate whether a string is in a given normalization form.

Unlike quickCheck, this function returns a definitive result. For NFD, NFKD, and FCD normalization forms, both functions work in exactly the same ways. For NFC and NFKC forms, where quickCheck may return Nothing, this function will perform further tests to arrive at a definitive result.

String comparison

Normalization-sensitive string comparison

data CompareOption Source

Options to compare.

Constructors

InputIsFCD	The caller knows that both strings fulfill the `FCD` conditions. If not set, `compare` will `quickCheck` for `FCD` and normalize if necessary.
CompareIgnoreCase	Compare strings case-insensitively using case folding, instead of case-sensitively. If set, then the following case folding options are used.
FoldCaseExcludeSpecialI	When case folding, exclude the special I character. For use with Turkic (Turkish/Azerbaijani) text data.

Instances

Enum CompareOption
Eq CompareOption
Show CompareOption
Typeable CompareOption

compare :: [CompareOption] -> Text -> Text -> Ordering Source

Compare two strings for canonical equivalence. Further options include case-insensitive comparison and code point order (as opposed to code unit order).

Canonical equivalence between two strings is defined as their normalized forms (NFD or NFC) being identical. This function compares strings incrementally instead of normalizing (and optionally case-folding) both strings entirely, improving performance significantly.

Bulk normalization is only necessary if the strings do not fulfill the FCD conditions. Only in this case, and only if the strings are relatively long, is memory allocated temporarily. For FCD strings and short non-FCD strings there is no memory allocation.

Locale-sensitive string collation

For the impure collation API (which is richer, but less easy to use than the pure API), see the Data.Text.ICU.Collate module.

data Collator Source

String collator type. Collators are considered equal if they will sort strings identically.

Instances

Eq Collator
Typeable Collator

collator :: LocaleName -> Collator Source

Create an immutable Collator for comparing strings.

If Root is passed as the locale, UCA collation rules will be used.

collate :: Collator -> Text -> Text -> Ordering Source

Compare two strings.

collateIter :: Collator -> CharIterator -> CharIterator -> Ordering Source

Compare two CharIterators.

If either iterator was constructed from a ByteString, it does not need to be copied or converted beforehand, so this function can be quite cheap.

sortKey :: Collator -> Text -> ByteString Source

Create a key for sorting the Text using the given Collator. The result of comparing two ByteStrings that have been transformed with sortKey will be the same as the result of collate on the two untransformed Texts.

uca :: Collator Source

A Collator that uses the Unicode Collation Algorithm (UCA).

Regular expressions

data MatchOption Source

Options for controlling matching behaviour.

Constructors

CaseInsensitive	Enable case insensitive matching.
Comments	Allow comments and white space within patterns.
DotAll	If set, `'.'` matches line terminators. Otherwise `'.'` matching stops at line end.
Literal	If set, treat the entire pattern as a literal string. Metacharacters or escape sequences in the input sequence will be given no special meaning. The option `CaseInsensitive` retains its meanings on matching when used in conjunction with this option. Other options become superfluous.
Multiline	Control behaviour of `'$'` and `'^'`. If set, recognize line terminators within string, Otherwise, match only at start and end of input string.
HaskellLines	Haskell-only line endings. When this mode is enabled, only `'\n'` is recognized as a line ending in the behavior of `'.'`, `'^'`, and `'$'`.
UnicodeWord	Unicode word boundaries. If set, `'\\b'` uses the Unicode TR 29 definition of word boundaries. Warning: Unicode word boundaries are quite different from traditional regular expression word boundaries. See http://unicode.org/reports/tr29/#Word_Boundaries.
ErrorOnUnknownEscapes	Throw an error on unrecognized backslash escapes. If set, fail with an error on patterns that contain backslash-escaped ASCII letters without a known special meaning. If this flag is not set, these escaped letters represent themselves.
WorkLimit Int	Set a processing limit for match operations. Some patterns, when matching certain strings, can run in exponential time. For practical purposes, the match operation may appear to be in an infinite loop. When a limit is set a match operation will fail with an error if the limit is exceeded. The units of the limit are steps of the match engine. Correspondence with actual processor time will depend on the speed of the processor and the details of the specific pattern, but will typically be on the order of milliseconds. By default, the matching time is not limited.
StackLimit Int	Set the amount of heap storage avaliable for use by the match backtracking stack. ICU uses a backtracking regular expression engine, with the backtrack stack maintained on the heap. This function sets the limit to the amount of memory that can be used for this purpose. A backtracking stack overflow will result in an error from the match operation that caused it. A limit is desirable because a malicious or poorly designed pattern can use excessive memory, potentially crashing the process. A limit is enabled by default.

Instances

Eq MatchOption
Show MatchOption
Typeable MatchOption

data ParseError Source

Detailed information about parsing errors. Used by ICU parsing engines that parse long rules, patterns, or programs, where the text being parsed is long enough that more information than an ICUError is needed to localize the error.

Instances

Show ParseError
Typeable ParseError
Storable ParseError
Exception ParseError

data Match Source

A match for a regular expression.

Instances

Show Match
Regular Match

data Regex Source

A compiled regular expression.

Regex values are usually constructed using the regex or regex' functions. This type is also an instance of IsString, so if you have the OverloadedStrings language extension enabled, you can construct a Regex by simply writing the pattern in quotes (though this does not allow you to specify any Options).

Instances

Show Regex
IsString Regex
Regular Regex

class Regular r Source

A typeclass for functions common to both Match and Regex types.

Instances

Regular Match
Regular Regex

Construction

regex :: [MatchOption] -> Text -> Regex Source

Compile a regular expression with the given options. This function throws a ParseError if the pattern is invalid, so it is best for use when the pattern is statically known.

regex' :: [MatchOption] -> Text -> Either ParseError Regex Source

Compile a regular expression with the given options. This is safest to use when the pattern is constructed at run time.

Inspection

pattern :: Regular r => r -> Text Source

Return the source form of the pattern used to construct this regular expression or match.

Searching

find :: Regex -> Text -> Maybe Match Source

Find the first match for the regular expression in the given text.

findAll :: Regex -> Text -> [Match]Source

Lazily find all matches for the regular expression in the given text.

Match groups

Capturing groups are numbered starting from zero. Group zero is always the entire matching text. Groups greater than zero contain the text matching each capturing group in a regular expression.

groupCount :: Regular r => r -> Int Source

Return the number of capturing groups in this regular expression or match's pattern.

unfold :: (Int -> Match -> Maybe Text) -> Match -> [Text]Source

A combinator for returning a list of all capturing groups on a Match.

span :: Match -> Text Source

Return the span of text between the end of the previous match and the beginning of the current match.

group :: Int -> Match -> Maybe Text Source

Return the nth capturing group in a match, or Nothing if n is out of bounds.

prefix :: Int -> Match -> Maybe Text Source

Return the prefix of the nth capturing group in a match (the text from the start of the string to the start of the match), or Nothing if n is out of bounds.

suffix :: Int -> Match -> Maybe Text Source

Return the suffix of the nth capturing group in a match (the text from the end of the match to the end of the string), or Nothing if n is out of bounds.